1 Motivation

Michael Blum tweeted about the STOIC2021 - COVID-19 AI challenge. The main goal of this challenge is to predict from the patients’ CT scans who will develop severe disease.

Given my recent interest in machine learning, this challenge peaked my interest. Although Python is the machine learning lingua franca, it is possible to train a convolutional neural network (CNN) in R and perform (binary) image classification.

Here, I will use an R interface to Keras that allows training neural networks.

First things first, load the packages we will need.

library(tidyverse)
theme_set(theme_light())
library(keras)

2 Read in and process data

We will need a function to process images, I’m stealing that one written by Spencer Palladino.

process_pix <- function(lsf) {
  img <- lapply(lsf, image_load, grayscale = TRUE) # grayscale the image
  arr <- lapply(img, image_to_array) # turns it into an array
  arr_resized <- lapply(arr, image_array_resize, 
                        height = 100, 
                        width = 100) # resize
  arr_normalized <- normalize(arr_resized, axis = 1) #normalize to make small numbers 
  return(arr_normalized)
}

Process pix for patients with Covid, and reshape. Idem for pix for patients without Covid.

# with covid
lsf <- list.files("dat/COVID/", full.names = TRUE) 
covid <- process_pix(lsf)
covid <- covid[,,,1] # get rid of last dim
covid_reshaped <- array_reshape(covid, c(nrow(covid), 100*100))
# without covid
lsf <- list.files("dat/non-COVID/", full.names = TRUE) 
ncovid <- process_pix(lsf)
ncovid <- ncovid[,,,1] # get rid of last dim
ncovid_reshaped <- array_reshape(ncovid, c(nrow(ncovid), 100*100))

We have 1252 CT scans of patients with Covid, and 1229 without.

Let’s visualise these scans. Let’s pick a patient with Covid, and another one without.

scancovid <- reshape2::melt(covid[10,,])
plotcovid <- scancovid %>%
  ggplot() +
  aes(x = Var1, y = Var2, fill = value) + 
  geom_raster() +
  labs(x = NULL, y = NULL, title = "CT scan of a patient with covid") + 
  scale_fill_viridis_c() + 
  theme(legend.position = "none")

scanncovid <- reshape2::melt(ncovid[10,,])
plotncovid <- scanncovid %>%
  ggplot() +
  aes(x = Var1, y = Var2, fill = value) + 
  geom_raster() +
  labs(x = NULL, y = NULL, title = "CT scan of a patient without covid") + 
  scale_fill_viridis_c() + 
  theme(legend.position = "none")

library(patchwork)
plotcovid + plotncovid

Put altogether and shuffle.

df <- rbind(cbind(covid_reshaped, 1), # 1 = covid
            cbind(ncovid_reshaped, 0)) # 0 = no covid
set.seed(1234)
shuffle <- sample(nrow(df), replace = F)
df <- df[shuffle, ]

Sounds great. We have everything we need to start training our convolutional neural network model.

3 Convolutional neural network (CNN)

Build our training and testing datasets using a 80/20 split.

set.seed(2022)
split <- sample(2, nrow(df), replace = T, prob = c(0.8, 0.2))
train <- df[split == 1,]
test <- df[split == 2,]
train_target <- df[split == 1, 10001] # label in training dataset
test_target <- df[split == 2, 10001] # label in testing dataset

Build our model. I use a single layer, but it’s easy to pipe others on top of it.

model <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu") %>% 
  layer_dropout(0.4) %>%
  layer_dense(units = 256, activation = "relu") %>%
  layer_dropout(0.3) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(0.2) %>%
  layer_dense(units = 2, activation = 'softmax')

Compile it.

model %>%
  compile(optimizer = 'adam',
          loss = 'binary_crossentropy', 
          metrics = c('accuracy'))

We use one-hot encoding with function to_categorical(), aka dumming coding in statistics.

train_label <- to_categorical(train_target)
test_label <- to_categorical(test_target)

Now fit model.

fit_covid <- model %>%
  fit(x = train,
      y = train_label, 
      epochs = 25,
      batch_size = 512, # try also 256, 512
      verbose = 2,
      validation_split = 0.2)

Visualize performances. Not too bad. No over/under-fitting. Accuracy and loss are fine.

plot(fit_covid)

Performance on test dataset?

model %>%
  evaluate(test, test_label)
##       loss   accuracy 
## 0.04746895 0.99387753

Let’s do some predictions on the test data, and compare with ground truth.

predictedclasses <- model %>%
  predict_classes(test)
table(Prediction = predictedclasses, 
      Actual = test_target)
##           Actual
## Prediction   0   1
##          0 243   2
##          1   1 244

Pretty cool. Only a few patients are misclassified.

Save model for further use.

save_model_tf(model, "model/covidmodel") # save the model

We are happy with the results. In general however, we need to find ways to improve the performances. A few tips here with examples implemented in Keras with R there.